Benchmarking and integration of methods for deconvoluting spatial transcriptomic data

Abstract Motivation The rapid development of spatial transcriptomics (ST) approaches has provided new insights into understanding tissue architecture and function. However, the gene expressions measured at a spot may contain contributions from multiple cells due to the low-resolution of current ST technologies. Although many computational methods have been developed to disentangle discrete cell types from spatial mixtures, the community lacks a thorough evaluation of the performance of those deconvolution methods. Results Here, we present a comprehensive benchmarking of 14 deconvolution methods on four datasets. Furthermore, we investigate the robustness of different methods to sequencing depth, spot size and the choice of normalization. Moreover, we propose a new ensemble learning-based deconvolution method (EnDecon) by integrating multiple individual methods for more accurate deconvolution. The major new findings include: (i) cell2loction, RCTD and spatialDWLS are more accurate than other ST deconvolution methods, based on the evaluation of three metrics: RMSE, PCC and JSD; (ii) cell2location and spatialDWLS are more robust to the variation of sequencing depth than RCTD; (iii) the accuracy of the existing methods tends to decrease as the spot size becomes smaller; (iv) most deconvolution methods perform best when they normalize ST data using the method described in their original papers; and (v) the integrative method, EnDecon, could achieve more accurate ST deconvolution. Our study provides valuable information and guideline for practically applying ST deconvolution tools and developing new and more effective methods. Availability and implementation The benchmarking pipeline is available at https://github.com/SunXQlab/ST-deconvoulution. An R package for EnDecon is available at https://github.com/SunXQlab/EnDecon. Supplementary information Supplementary data are available at Bioinformatics online.

 cell2location. It estimates the cell type signature from scRNA-seq data using a model based on negative binomial regression and then uses this reference signature to estimate the abundance of each cell type in each spot (Kleshchevnikov, et al., 2020).
 stereoscope. It utilizes a probabilistic model to decompose cell type mixtures in spatial data, assuming that both spatial and scRNA-seq data follow a negative binomial distribution (Andersson, et al., 2020).
 STRIDE. This method leverages latent Dirichlet allocation (LDA), a generative probabilistic model, trained from scRNA-seq data to decompose spot expression into individual cell types (Sun, et al., 2022).
 DestVI. It uses a probabilistic method constructed by variational inference and latent variable models to obtain the cell type decomposition at each spot (Lopez, et al., 2021).
 STdeconvolve. This method is built on an LDA model to infer the proportional representation of cell types in each multi-cellular spot, without relying on external single-cell reference (Miller, et al., 2022). So, it is a reference-free deconvolution method for ST deconvolution, taking into account of limited availability of suitable single-cell data as a reference due to technical and other reasons.

Deep learning-based deconvolution methods
 DSTG. It deconvolutes ST data by leveraging graph convolutional networks trained from pseudo-ST data, where pseudo-ST data is generated by randomly combining expression data of scRNA-seq cells (Song and Su, 2021).
 Tangram. It learns a spatial alignment for scRNA-seq data by using a deep learning framework and non-convex optimization, which allows decomposing ST data by assigning cells from scRNA data to spatial locations (Biancalani, et al., 2021).

Text S2 Generation of datasets with different sequencing depths
To test the robustness of different deconvolution methods concerning sequencing depth on three synthetic ST datasets, we adopted the 'downsampleMatrix' function in the DropletUtils package (https://www.bioconductor.org/packages/release/bioc/html/DropletUtils.html) to down-sample the spatial expression matrix to generate datasets with different sequencing depths. Specifically, the relatively lower number of genes captured by MERFISH and the lower counts per gene sequenced by sci-Space resulted in low resolution of the gene expression in these two synthetic ST datasets. Therefore, we down-sampled these datasets by setting the down-sampling rate = 1, 0.8, 0.6, 0.2, where the down-sampling rate = 1 implies the raw counts. For the mouse brain (mapped sc-ST) dataset, the total counts per spot consisted of the counts of multiple scRNA-seq cells, we then set the down-sampling rate = 1, 0.5, 0.2, 0.05, 0.01. Notably, the sequencing depth of the simulated dataset generated by the down-sampling rate = 0.01 was comparable to that of the real ST data. Therefore, subsequent experiments on the mouse brain (mapped sc-ST) dataset were performed based on simulated data with the down-sampling rate = 0.01. Table S1 lists details of the counts corresponding to different down-sampling rates on the three simulated datasets.

Text S3 Generation of datasets with different spot sizes
To investigate the impact of different spot sizes on the performance of different deconvolution methods, we defined squares with different areas to adjust the spot size, i.e., the number of cells contained in a spot. Since the square size of the embryo (sci-Space) dataset was defined during sequencing and was not adjustable during the simulation process, we did not test the impact of different spot sizes on this dataset. For MPOA (MERFISH) and mouse brain (mapped sc-ST) datasets, we generated simulated datasets with three different spot sizes, 25 μm, 55 μm, and 150 μm in diameter, respectively.

Text S4 Data normalization choices
To evaluate the impact of different normalization methods on the performance of deconvolution methods, we processed ST data using four normalization methods: transcripts per million (TPM), normalizeGiotto (Dries, et al., 2021), SCTransform (Stuart, et al., 2019), and unit variance. We collected the normalization methods used in the current ST deconvolution methods. Since some methods require the input to be the original count matrix, we only performed the normalization benchmark on those without restrictions on the input format of ST data. Meanwhile, the normalization of the reference scRNA-seq data was kept consistent with the original requirements of the ST deconvolution methods.

cell2location.
We followed the guidelines provided on the cell2location website: https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html. We trained the single-cell model on the reference data with parameters max_epochs = 250 and lr = 0.002. The cell2location model was obtained on the ST data with parameters max_epochs = 30000.
spatialDecon. We followed the guidelines provided on the spatialDecon website: https://github.com/Nanostring-Biostats/SpatialDecon. We first created a profile matrix with reference data using the 'creat_proflie_matrix' function with the parameter minGenes = 0. The deconvolution of ST data was employed by using the 'runspatialdecon' function while setting parameter bg as 0.01.
STdeconvolve. We followed the commands described on the STdeconvolve website: https://github.com/JEFworks-Lab/STdeconvolve. The 'restrictCorpus' function was used for selecting genes with parameters removeAbove =1.0, removeBelow = 0.05. The STdeconvolve model was fitted by the 'fitLDA' function and then the 'optimalLDA' function was used to select the optimal topic with the parameter opt = number of cell types.
We first subsampled the reference data by setting the lower and upper bounds of 25 and 250 cells for each cell type, respectively. The reference model and spatial model were trained with the same parameter max_epoch = 75000.

Text S6 Evaluation metrics
We calculated the following metrics to evaluate the performance of different deconvolution methods by using the known cell type proportions in the simulated datasets as the ground truth.
Assume matrices × and × represent the predicted cell type proportions and the known cell type proportions, respectively, where represents the number of spots and represents the number of cell types. Accordingly, the vectors × and × represent the predicted cell type proportions and the known cell type proportions. Note that × = , , ⋯ , and × = [ , , ⋯ , ] .
(1) RMSE. The RMSE value between the predicted cell type proportion and the known cell type proportions was calculated as follows: where and are the -th element of the predicted cell type proportion vector × and the known cell type proportions vector × , respectively. Lower RMSE value corresponds to better performance of the deconvolution method.
(2) PCC. The PCC value was calculated as follows: where and represent the -th element of the predicted and known cell type proportion vectors, respectively; and are the mean proportions of the predicted and known cell type proportion vectors, respectively; and are the standard deviations of the predicted and known cell type proportion vectors, respectively. Higher PCC value corresponds to more accurate deconvolution.
(3) JSD. The JSD metric measures the similarity between two probability distributions. We calculated the JSD value of each spot as follows: where and represent the cell type proportions of spot in the predicted result and the ground truth, respectively.
Eq. (S4) calculates the Kullback-Leibler divergence between the two probability distributions and , where the predicted probability and expected probability of cell type in spot are represented by and , respectively.
The lower JSD value indicates that the precited cell type proportions are more similar to the ground truth cell type proportions.
(4) Running time and memory. For the deconvolution methods using R language, we used the 'memory_used' function from the pryr package to evaluate the memory changes and used the 'proc.time' function to measure the running time.
For the deconvolution methods using Python language, memory changes were calculated using the 'memory_full_info' function from the psutil package and the running time was assessed with the 'time' function from the time package.
(5) Score aggregation. To rank different methods, we aggregated different scores at two levels: across different datasets and different metrics. The aggregation procedure is explained in detail as follows: For each dataset, we first normalized the values of RMSE, PCC, and JSD (50% median) by using Eq. (S5-S7), respectively.
where = 1, 2, 3 represents the -th synthetic dataset, , , and represent the RMSE values, PCC values, and JSD (50% median) values of 14 methods on the -th dataset, respectively. Lower RMSE and JSD correspond to higher values of and , and higher PCC correspond to higher values of .
After normalization, the values of each of the above scores were aggregated across different datasets by calculating their arithmetic mean, as shown in Eq. (S8-S10). The aggregated scores of different metrics were then averaged for each method by using Eq. (S11), which was used as the ultimate overall score for ranking.
In addition, the usability of each method was scored using Eq. (S13), 1 = ( ) 2 usability time memory score score score  . (S13) where and for scoring time and memory are detailed in Table S1.         Table S2 lists details of down-sampling rates. Regarding embryo (sci-Space) and MPOA (MERFISH) datasets, cell2location, RCTD, and spatialDWLS consistently outperformed other deconvolution methods at different downsampling rates, and most of the deconvolution methods were robust to the variation of down-sampling rates. Meanwhile, Giotto/Hypergeometric, Giotto/rank, Giotto/PAGE, and spatialDWLS could not work on the simulated dataset with the down-sampling rate = 0.2, due to the high dropout ratio of this ST dataset. For the mouse brain (mapped sc-ST) dataset, cell2location, RCTD, and spatialDWLS performed best with the down-sampling rate = 0.01, where the sequencing depth was close to that of real ST data. However, the RMSE value of RCTD gradually became larger as the down-sampling rate increased. In addition, stereoscope became gradually worse when the down-sampling rate increased, while spatialDecon became gradually worse when the down-sampling rate decreased.

Fig. S9.
Comparing the sensitivity of different deconvolution methods to different spot sizes. The bar plot represents the RMSE between the expected proportion and the output proportions from the different deconvolution methods. Different colors represent different spot sizes. Since the spots' coordinates in the embryo (sci-Space) dataset were not continuous and thus could not be used to generate synthetic datasets with different spot sizes, we adopted the MPOA (MERFISH) and mouse brain (mapped sc-ST) datasets for evaluation (see Methods). The spot with larger size contains more cells. Notably, the RMSE values of the deconvolution methods tended to become larger as the spot size decreased from 150 μm to 25 μm (Fig. 4). This may be due to that smaller size of spot contains less cells so that the sample size for inferring cell type proportions becomes even small, leading to increased bias for deconvolution. Different colors represent different ST normalization methods. Because some deconvolution methods (e.g., cell2location, RCTD, spatialDecon, stereoscope, and STdeconvolve) explicitly require raw gene expression matrix as input and do not support normalization processing, we did not include these methods for investigation.